fix tpd panic loop: cxo Finc-to-negative + httputil short-write discriminator#2422
Merged
0pcom merged 1 commit intoskycoin:developfrom May 4, 2026
Conversation
…iminator Production TPD was restarting every 30-40s (RestartCount 556 over 6h on the prod host) because two distinct panics tear down the process: 1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216 panic: "Finc to negative for: <hash>" The filling-refcount went below zero — likely a duplicate Finc on a Filler.incs map, or an Inc/Finc mismatch across overlapping fillers. Hard process kill via panic. Filler.apply / Filler.reject already consume Finc's error return and just log; surface the inconsistency through that path instead. Clamp fc to 0, log the condition with key and the offending inc, and continue. Worst case is a leaked filling-item slot — orders of magnitude better than killing the service. 2. pkg/httputil/httputil.go WriteJSON:50 panic: "short write: i/o deadline reached" isIOError checks errors.Is(err, io.ErrShortWrite), but net/http's timeoutWriter returns its own error value with the same message string when a write deadline expires mid-response. errors.Is misses it (different sentinel), the fallback string match didn't include "short write", so getAllTransports' ~1MB JSON write to a slow client panics on every deadline hit. Added "short write", "i/o timeout", and "deadline exceeded" to the string-match fallback. New TestIsIOErrorShortWriteVariants pins all sentinel and string-match cases. Neither bug is caused by skycoin#2415/skycoin#2418/skycoin#2421; they were just made visible because deploys cycling at 30-40s aren't subtle. Together these stop the panic loop without changing any data semantics.
5 tasks
0pcom
added a commit
that referenced
this pull request
May 4, 2026
#2423) #2422 silenced the two dominant TPD panics ('Finc to negative', 'short write: i/o deadline reached'). RestartCount went from ~556/6h to 2/10min. The remaining two panics are nil-derefs in pkg/cxo/node/head.go from torn-down fillHead state: 1. createFiller line 338: cr.c.String() panics when cr.c is nil. handleDelConn (line 517) sets f.p.c = nil when the source connection is removed. f.p.r remains non-zero, so the f.p == (connRoot{}) gate at handleFillingResult does NOT catch the nil-c case, and createFiller(f.p) runs with a nil source. Fix: nil-c short-circuit at the top of createFiller. The fill can't proceed without a source connection — log and return. 2. handleSuccess line 228 (and the parallel sites in handleRequest, handleRequestFailure, handleReceivedRoot): f.fc.PushBack on a nil *list.List. closeFiller (line 384) sets `f.rqo, f.fc, f.rq = nil, nil, nil`, so any in-flight Filler goroutine messages that drain after close land on a nil list. Fix: nil-guard each PushBack/PushFront site (4 sites total). Same recovery as #2422's panic→drop pattern: the closed filler's work is no longer relevant. These are the actual races; the right "real" fix would be to drain or signal-shutdown the channels before nilling state, but that's a larger structural change. Guards stop the process kill while preserving observability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production TPD has been restarting every 30-40 seconds (
RestartCount: 556over 6 hours on the prod host running develop tip96ee3806c). Two distinct panics, found by greping the container logs, are tearing down the process:1.
pkg/cxo/skyobject/cache.go—(*Cache).Fincto negativeStack:
The filling-refcount goes below zero — likely a duplicate
Fincon aFiller.incsmap, or anInc/Fincmismatch across overlapping fillers. The historical behavior was to panic, which became visible after #2420 increased the rate of fill churn (or the bug was always latent and load surfaced it).Filler.applyandFiller.rejectalready consumeFinc's error return and just log:So surfacing the inconsistency through that path costs nothing extra. The fix clamps
fcto 0, logs the offending key + inc, and continues. Worst case is a leaked filling-item slot — orders of magnitude better than a process kill every 30 seconds.2.
pkg/httputil/httputil.go—WriteJSONpanic on slow-client timeoutStack:
getAllTransportswrites ~1MB JSON. On a slow client,net/http's per-write deadline expires mid-response, and the response writer returns an error whose message is"short write: i/o deadline reached". The body isio.ErrShortWrite's text, but the value isnet/http.timeoutWriter's own error —errors.Is(err, io.ErrShortWrite)returns false. The fallback string match inisIOErroronly checked"broken pipe"and"connection reset", so the discriminator returnedfalseandWriteJSONran the panic branch.Added
"short write","i/o timeout", and"deadline exceeded"to the string-match fallback.TestIsIOErrorShortWriteVariantspins both sentinel and string-match cases.Test plan
go build ./...clean.go vet ./pkg/cxo/... ./pkg/httputil/...clean.go test ./pkg/httputil/ ./pkg/cxo/skyobject/ ./pkg/transport-discovery/...all pass.gofmtclean.TestIsIOErrorShortWriteVariantscovers nil, all four current sentinels, the production short-write case, the i/o-timeout / deadline-exceeded text variants, and the existing broken-pipe / connection-reset cases.RestartCountno longer climbing;docker logs transport-discovery | grep -c panic:reaches 0 within a few minutes.Notes
Neither bug is caused by the recent latency PRs (#2415, #2418, #2421); the panics' stacks make that explicit. They've just been hammering the deployment more visibly since latency persistence and outlier filtering landed back-to-back, because every TPD bounce now drops registration TTLs on visors that haven't re-pushed yet. Fixing these two restores normal TPD uptime.